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Abstract 

In this paper we demonstrate the applicability of latent Dirichlet allocation (LDA) for 
classifying large Web document collections. One of our main results is a novel influence 
model that gives a fully generative model of the document content taking linkage into 
account. In our setup, topics propagate along links in such a way that linked documents 
directly influence the words in the linking document. As another main contribution we 
develop LDA specific boosting of Gibbs samplers resulting in a significant speedup in our 
experiments. The inferred LDA model can be applied for classification as dimensionality 
reduction similarly to latent semantic indexing. In addition, the model yields link weights 
that can be applied in algorithms to process the Web graph; as an example we deploy LDA 
link weights in stacked graphical learning. By using Weka's BayesNet classifier, in terms of 
the AUC of classification, we achieve 4% improvement over plain LDA with BayesNet and 
18% over tf.idf with SVM. Our Gibbs sampling strategies yield about 5-10 times speedup 
with less than 1% decrease in accuracy in terms of likelihood and AUC of classification. 

Keywords: Web document classification, latent Dirichlet allocation, topic dis- 
tribution 

1 Introduction 

In this paper we demonstrate the applicability of latent Dirichlet allocation [5] , a computation- 
ally challenging but very powerful generative model for large scale Web document classification 
relying on hyperlinkage in addition to text content. Web content classification is a research 
area that abounds with opportunities for practical solutions. The performance of most tradi- 
tional machine learning methods is limited by their disregard for the interconnection structure 
between web data instances (nodes). At the same time, relational machine learning methods 
often do not scale to web-sized data sets and, prior to our result, LDA models in general and in 
particular those that leverage on the link structure were thought to require an unfeasibly large 
amount of resources on the Web scale. 

We apply one of the most successful generative topic models, latent Dirichlet allocation 
(LDA) developed by Blei, Ng and Jordan [5] for Web site classification. Generative topic 
models [TTJ [T51 [5] have a wide range of applications [HI [301 131 HH OH] in the fields of language 
processing, text mining and information retrieval, including categorization, keyword extraction, 
similarity search and statistical language modeling. An LDA model consists of latent topics 
described by distributions over vocabulary terms, and every term occurrence arises based on 
the topic distribution corresponding to the document in question. As a starting point of our 
results, we may use latent topics for dimensionality reduction prior to classification as already 
suggested but since then less explored in [5]. 
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Recently several models extend LDA to exploit links between web documents or scientific 
papers [13j [12l [21] . In these models the term and topic distributions may be modified along 
the links. All these models have the drawback that every document is thought of either citing 
or cited, in other words, the citation graph is bipartite, and influence flows only from cited 
documents to citing ones. 

In this paper we develop the linked LDA model, in which each document can cite to and 
be cited by others and thus be influenced and influence other documents. Linked LDA is very 
similar to the copycat model of Dietz, Bickel and Scheffer [12] with the main difference that 
in our case the citation graph is not restricted to be bipartite. This fact and its consequences 
are the main advantage of our model, namely, that the citation graph is homogeneous, and so 
one does not have to take two copies, citing and cited, of every document. In addition we give 
a flexible model of all possible effects, including cross-topic relations and link selection. As an 
example, we may model the fact that topics Business and Computer are closer to one another 
than to Health as well as the distinction between topically related and unrelated links such 
as links to software to view the content over a Health site. The model may also distinguish 
between sites with strong, weak or even no influence from its neighbors. The linked LDA model 
is described in full detail in Section |2~21 

We demonstrate the applicability of linked LDA for text categorization, an application 
although explicitly mentioned in [5] but, in our best knowledge, justified prior to our work only 
in special applications [4 . The inferred topic distributions of documents are used as features 
to classify the documents into categories. In the linked LDA model a weight is inferred for 
every link. In order to validate the applicability of these edge weights, we show that their usage 
improves the performance of stacked graphical classification, a meta-learning scheme introduced 
in [19]. 

The crux in the scalability of LDA for large corpora lies in the understanding of the collapsed 
Gibbs sampler for inference. In the first application of the Gibbs sampler to LDA [16] as well as 
in the fast collapsed Gibbs sampler [23] the unit of sampling or, in other terms, a transition step 
of the underlying Markov chain, is the redrawing of one sample for a single term occurrence. 
The storage space and update time of all these counters prohibit sampling for very large corpora. 
Since however the order of sampling is neutral, we may group occurrences of the same term 
in one document together. Our main idea is then to re-sample each of these term positions in 
one step and assign a joint storage for them. We introduce three strategies: for aggregated 
sampling we store a sample for each position as before but update all of them in one step for 
a word, for limit sampling we update a topic distribution for each distinct word instead of 
drawing a sample, while for sparse sampling we randomly skip some of these words. All of 
these methods result in a significant speedup of about 5-10 times, with less than 1% decrease 
in accuracy in terms of likelihood and AUC of classification. The largest corpus where we could 
successfully perform classification using these boostings consisted of 100k documents (that is 
web sites with a total of 12M pages), and altogether 1.8G term positions. 

To assess the prediction power of the proposed features, we run experiments on a host-level 
aggregation of the . uk domain, which is publicly available through the Web Spam Challenge [6] . 
We perform topical classification into one of 11 top-level categories of the Open Directory 
(http : / / dmoz . org) . Our techniques are evaluated along several alternatives and, in terms of 
the AUC measure, yield the improvement of 4% over plain LDA and 18% over tf.idf with SVM 
(here BayesNet is used on linked LDA based features). 

The rest of the paper is organized as follows. Section [2] reviews the main concepts of LDA 
and then introduces our linked LDA model. Section [3] describes the experimental setup and 
Section E] the results. 

1.1 Related results 

The use of latent topics in information retrieval tasks starts with latent semantic indexing 
(Deerwester, Dumais, Landauer, Furnas, Harshman |llj). a method that represents documents 
in a low rank approximation of the term space. Probabilistic latent semantic analysis (PLSA, 
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Hofmann [T8]) extends this idea by defining a generative model over the latent topics that yields 
a term distribution for each document. As the starting point of our results, latent Dirichlet 
allocation (LDA, Blei, Ng, Jordan [5]) introduces additional metaparameters and sampling from 
Dirichlet distributions to yield a model with astonishing performance in various tasks. 

We compare our linked LDA model to preexisting extensions of PLSA and LDA that jointly 
model text and link as well as the influence of topics along links. The first such model is PHITS 
defined by Cohn and Hoffman [9]. The mixed membership model of Erosheva, Fienberg and 
Lafferty [13] can be thought of as an LDA based version of PHITS. Common to these models is 
the idea to infer similar topics to documents that are jointly similar in their bag of words and 
link adjacency vectors. Later, several similar link based LDA models were introduced, including 
the copycat model, the citation influence model by Dietz, Bickel and Scheffer [T2] and the link- 
PLSA-LDA and pairwise-link-LDA models by Nallapati, Ahmed, Xing and Cohen [21]. These 
results extend LDA over a bipartition of the corpus into citing and cited documents such that 
influence flows along links from cited to citing documents. They are shown to outperform earlier 
methods [T21 El]. The copycat model is very similar to linked LDA, with the only difference 
that in the former every document d is duplicated into a citing and a cited copy, the topics in 
the citing copy are drawn from d's topic distribution, while those in the cited copy are drawn 
from a cited document's topic distribution. In contrast, in linked LDA, every topic either from 
d! or a cited document's topic distribution. The citation influence model is a finer version of 
the copycat model, in that there the citing copy's topics are drawn either from d' or a cited 
document's topic distribution. The link-PLSA-LDA and pairwise-link-LDA models differ from 
these in that they generate the links. We make comparisons to the link-PLSA-LDA bipartite 
model in this paper. 

While these four models generate topical relation for hyperlinked documents, in a homoge- 
neous corpus one has to duplicate each document and infer two models for them. This is in 
contrast to the linked LDA model introduced in this paper whose main advantage is that it 
treats citing and cited documents identically, and no duplication is needed. 

As a completely different direction for link based LDA models, we mention the results 
[32l l33l [26] which give a generative model for the links of a network, with no words at the 
nodes. 

We also compare the performance of our results to general classifiers aided by the Web 
hyperlinks. Relational learning methods (presented, for instance, in |15] ) also consider existing 
relationships between data instances. The first relational learning method designed for topical 
web classification was proposed by Chakrabarti, Dom and Indyk [8] and improved by Angelova 
and Weikum pQ. Several subsequent results [24j and the references therein] confirm that clas- 
sification performance can be significantly improved by taking into account the labels assigned 
to neighboring nodes. In our baseline experiments we use the most accurate hypertext classi- 
fiers [7] obtained by stacked graphical learning, a meta-learning scheme introduced by Kou and 
Cohen Q15]- In stacked graphical learning, first a base learner is applied to the training data 
to produce initial predictions. Then the set of features is expanded by adding the predictions 
of related instances from the first step. Finally, the base learner is re-applied to the expanded 
feature set, resulting in a stacked model. Performance of stacked graphical learning is evaluated 
in Section 14.31 both with various graph based edge weights [TO] and with those inferred by our 
linked LDA model. 

Another main contribution of this paper is three new efficient inference methods. Upon 
introducing LDA, Blei, Ng, Jordan 5 proposed a variational algorithm for inference. Later, 
several other methods were described for inference in LDA, namely collapsed Gibbs sampling 
[16], expectation propagation, and collapsed variational inference [27] ■ Besides, [22] suggests 
methods how Gibbs sampling can be applied in a paralleled environment. 

Among the above collapsed Gibbs sampling methods, fastest convergence is achieved by 
that of Griffiths and Steyvers [16]. To our best knowledge, prior to our result there has been 
one type of attempt to speed up LDA inference in general, and LDA-based Gibbs sampling in 
particular. Porteous, Newman, Ihler, Asuncion, Smyth, Welling [23j modify Gibbs sampling 
such that it gives the same distribution by using search data structure for sample updates, and 
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call it fast Gibbs sampler. They show significant speedup for large topic numbers. 



2 LDA models and Gibbs sampling 
2.1 Background 

In order to prepare the necessary background and notation for our linked LDA model in the 
next subsection, we shortly describe the Gibbs sampling method for latent Dirichlet allocation 
[5]. For a detailed elaboration we refer to Heinrich [17]. We have a vocabulary V consisting 
of terms, a set T of k topics and m documents of arbitrary length. For every topic z 6 T a 
distribution ip z on V is sampled from Dir(/3), where /3 £ is a positive smoothing parameter. 
Similarly, for every document d a distribution $d on T is sampled from Dir(a), where a £ 
is a positive smoothing parameter. 

The words of the documents are drawn as follows: for every word position of document d a 
topic z is drawn from dd, and then a term is drawn from ip z and filled into that position. The 
notation is summarized in the widely used Bayesian network representation of LDA in Figure 



In this paper we use Gibbs sampling |16] for LDA model inference. Gibbs sampling is a 
Monte Carlo Markov chain algorithm for sampling from a joint distribution p(x), x £ R", if all 
conditional distributions p{xi\x-i) are known = [x\, . . . , Xi-%, 2^+1, . . . , x n )). In LDA the 
goal is to estimate the distribution p(z\w) for z £ T p , w £ V p where P denotes the set of word 
positions in the documents, hence Gibbs sampling makes use of the values p{zi — z'\z—i, w) for 
i £ P. In the initialization step a random topic assignment Zj, i £ P is chosen. 

Gibbs sampling for LDA has an efficiently computable closed form as deduced for example 
in |17j . Before describing the formula, we introduce the usual notation. We let d be a document 
and Wi its word at position i. We also let count Nd z be the number of words in d with topic 
assignment 2, N zw be the number of words w in the whole corpus with topic assignment z, 
Nd be the length of document d and N z be the number of all words with topic assignment z. 
A superscript N~ l denotes that position i is excluded from the corpus when computing the 
corresponding count. Now the Gibbs sampling formula becomes 
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Figure 1: LDA as a Bayesian network 



p(7 H = z'\z-i,w) oc 



(1) 



After a sufficient number of iterations we stop with the current topic assignment sample z. 
From z, the variables ip and i9 are estimated as 



and 

, / \ N dz +a(z) 

M z ) = AT t rr- ( 3 ) 

The likelihood of an inferred LDA model on a set P of word positions in a collection of 
held-out documents is 

J|p(mO' 1/|P| where p(w € ) = ^ ip z (Wi)$ d (z), (4) 

ieP z£T 

where d is document of position containing i. Here i9 is a MAP estimate of the topic-distribution 
of the document, and is usually approximated by unseen inference. 

2.2 Linked LDA 

Next we extend latent Dirichlet allocation to model the effect of a hyperlink between two 
documents on topic and term distributions. The key idea, summarized as a Bayes net in Figure 
[2J is to modify the topic distribution of a position on the word plate based on a link from the 
current document on the document plate. For each position we select either an outlink or the 
document itself to modify the topic distribution of the original LDA model. 

Formally we introduce linked LDA over the notations of the previous subsection for vocabu- 
lary V, the fc-element topic set T and the document set D. Links are represented by a directed 
graph with inlinks for cited and outlinks for citing documents. Our model also relies on the LDA 
distributions ip z and $d- We introduce an additional distribution \d on the set Sd — {d and its 
outncighbors} for every document d, sampled from Dir^), where jd is a positive smoothing 
vector on Sd- 

As also seen in the Bayes net of Figure [U the words of the documents are drawn as follows. 
For every word position i of document d, we 

• draw an influencing document r £ Sd from Xd, 

• draw a topic z from i? r (instead of fid as in LDA), 

• draw a term from ip z and fill into the position. 



O 
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Figure 2: Linked LDA as a Bayesian network 



Note that for sake of a unified treatment, d itself can be an influencing document of itself. 
This is in contrast to the citation influence model of |12j , where for every word a Bernoulli draw 
decides whether in the citing copy the influencing document is d itself or an outneighbor of it. 

We describe the Gibbs sampling inference procedure for linked LDA. Naturally, here N dz 
denotes the number of words with topic assignment z influenced by document d, and similarly 
for N zw , Nd and N z . Note that this document d is not necessarily the one containing word 
w, it can be an outneighbor as well. The goal is to estimate the distribution p(r, z\w) for 
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r G D p , z G T p , w G where P denotes the set of word positions in the documents. In 
Gibbs sampling one has to calculate p{zi = z',rt — r'\z-i,r—i,w) for i G P. First, it can be 
shown that 

p(r,z,w\(p,$,x) = (5) 

tt A(M^ + 7 d) n A(Ng + q) , f A(N™ + /3) 

H A( 7(1 ) ' H A(a) ' 11 AGS) 
dec wa; den v ' zgt vr/ 

Here MJJ is the count vector of influencing documents appearing at the words of document d, 
is the count vector of observed topics influenced by d, and N™ is the count vector of words 
with topic z. 

By division, ([5]) gives an iteration in the Gibbs sampling as follows: 

p(zi = z',n = r'\z-i,r-i,w) = (6) 

p(z,r,w) I 
p(z-i, r-u w-i) p(wi\z-i, r-i, w-i) ' 

thus 

p{zi = z',n = r'\z-i,r-i,w) oc (7) 
Pjz, r, w) 
p(z-i, r-i, w-i) 

N-,i,+a(z>) M-;,+ ld (r') 



Here Md r denotes the number of positions in document d influenced by outneighbor r G Sd- 

While this "two-coordinate" sampling is against the general Gibbs procedure that re-samples 
one coordinate at a time, it has the required distribution p(z\w) as its unique stationary dis- 
tribution. This follows from the fact that it is a random walk over a finite state Markov chain 
which is irreducible and aperiodic as a, j3 > 0. Thus it has a unique stationary distribution, 
which is necessarily the distribution p(z\w) by construction. 

Similarly to LDA, after a sufficient number of iterations we stop with the current topic 
assignment sample z, and estimate ip as in (|2|), $ as in ([3]), and x by 

, v M dr + ld {r) 
Xd(r) = ^ , — T-cr. 8 

For an unseen document d, distributions and x can be estimated exactly as in © and 
p]l once we have a sample from its word topic assignment z and word influencing document 
assignment r. Similarly to LDA, unseen inference for linked LDA is the same as doing linked 
LDA model inference for the whole corpus (train and test corpora), in such a way that the z 
and r assignments for the training documents are kept throughout. 

The likelihood of the inferred linked LDA model on a held-out corpus is calculated analo- 
gously as for LDA in Equation ((4]): 

][Ip(wi)~ 1/|P| wheiep(wi)= ^ Pz(w i )'&r(z)Xd(r)- (9) 

ieP z£T,reS d 

2.3 Fast Gibbs sampling heuristics 

In this section we describe three strategies for faster inference for both LDA and linked LDA. 
The methods modify the original Gibbs sampling procedure |16j . For simplicity, we describe 
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them for the plain LDA setting. The speedup obtained by these boostings is evaluated in 
Section OJ 

All of our methods start by sorting the words of the original documents so that sampling 
is performed subsequently for the occurrences of the same word. We introduce additional 
heuristics to compute the new samples for all occurrences of the same word in a document at 
once. 

In aggregated Gibbs sampling we calculate the conditional topic distribution F as in 
Equation ([T]) for the first occurrence i of a word in a given document. Next we draw a topic 
from F for every position with the same word without recalculating F and update all counts 
corresponding to the same word. In this way the number of calculations of the conditional 
topic distributions is the number of different terms in the document instead of the length of the 
document, moreover, the space requirement remains unchanged. Thus the speedup is larger 
if there are more multiple occurrences of the words. This mostly happens for large corpora. 
This performance can be further improved by maintaining the aggregated topic count vector 
for terms with big frequency in the document, instead of the storing the topic at each word. 

Limit Gibbs sampling heavily relies on the bag of words model assumption that the topic 
of a document remains unchanged by multiplying all term frequencies by a constant. In the 
limit hence we may maintain the calculated conditional topic distribution F for the set of all 
occurrences of a word, without drawing a topic for every single occurrence. Equation ([T]) can 
be adapted for this setting by a straightforward redefinition of the N counts. 

It is easy to check that if the topic distributions for all positions are uniform then we get an 
instable fixed point of limit Gibbs sampling, provided both a and /3 are constant. Clearly, with 
large probability, these fixed points can be avoided by selecting biased initial topic distributions. 
We never encountered such instable fixed points during our experiments. 

Similarly to aggregated Gibbs sampling, depending on the size and term frequency distri- 
bution of the documents, limit sampling may result in compressed space usage. 

Sparse sampling with sparsity parameter £ is a lazy version of limit Gibbs sampling where 
we ignore some of the less frequent terms to achieve faster convergence on the more important 
ones. On every document we sample doclcngth/^ times from a multinomial distribution on 
the distinct terms with replacement, by selecting a term by a probability proportional to its 
term frequency tf w in the document. Hence with £ — 1 we expect a performance similar to 
limit Gibbs sampling, while large £ results in a speedup of about a factor of £, with a trade-off 
of lower accuracy and slower convergence. The idea of laziness can be naturally built upon 
aggregated sampling alone, without limit sampling, and we will indeed evaluate this sampling 
(called aggregated sparse) in Table [3] in Section l4~3l 

Aggregated Gibbs sampling has the required distribution p(z\w) as its unique stationary 
distribution. Indeed, it is a random walk over a finite state Markov chain which is irreducible 
and aperiodic as a,/3 > 0, implying that it has a unique stationary distribution, which is 
necessarily the distribution p(z\w) by construction. 

As for limit Gibbs sampling, we rely on the assumption, that in the bag of words model, 
multiplying all term frequencies in the corpus by a constant the semantic meaning of the 
documents change only moderately. As limit Gibbs sampling arises as a limit of aggregated 
Gibbs sampling by tending this constant to infinity, its stationary distribution is very close to 
what the aggregated version samples, that is, the required p(z\w). This argument is justified 
by our measurements in Section |4j 

Laziness clearly keeps the above arguments valid, thus aggregated sparse Gibbs sampling 
has stationary distribution p(z\w), and sparse sampling the same as limit. 

3 Experimental setup 

3.1 The data set and classifiers 

In our experiments we use the 114k node host-level aggregation of the WEBSPAM-UK2007 . uk 
domain crawl, which consists of 12. 5M pages and is publicly available through the Web Spam 
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Challenge [BJ. We perform topical classification into one of the 14 top-level English language 
categories of the Open Directory (http : //dmoz . org) while excluding category "World" con- 
taining non-English language documents. If a site contains a page registered in DMOZ with 
some top category, then we label it with that category. In case of a conflict we choose a random 
page of the site registered in DMOZ and label its site with its top category. In this way we 
could derive category label for 32k documents. 

Weperform the usual data cleansing steps prior to classification. After stemming by Tree- 
TaggeiQ and removing stop words by the Onix lis10, we aggregate the words appearing in all 
HTML pages of the sites to form one document per site. We discard rare terms and keep the 
100k most frequent to form our vocabulary. We discard all hosts that become empty, that is 
those consisting solely of probably only a few rare terms. To reduce unnecessary computational 
load we also discard all unlabeled hosts with more than 100k remaining word occurrences. Fi- 
nally we weight directed links between hosts by their multiplicity. For every site we keep only 
at most 50 outlinks with largest weight. 

We call this the big corpus, as it contains 1.8G positions, far more than the usual corpora 
on which LDA experiments are carried out. Since the big corpus is infeasible for the baseline 
experiments, we also form a small corpus consisting of the labeled hosts only, to be able to 
compare our results with the baseline. We use the most frequent 20k terms as vocabulary and 
keep only the 10 largest weight outlink for each host. Note that even the small corpus is large 
enough to cause efficiency problems for the baseline classifiers. 

We chose 11 out of the 14 categories to apply classification to them, as the other 3 categories 
were very small in our corpus. In our experiments we perform two-class classification for all 
of these 11 big categories. We use the machine learning toolkit Weka [3U] to apply SVM, 
C4.5 decision tree and the Bayes net implementation of Weka, called BayesNet. As for graph 
stacking we used a home made Java code integrated into Weka. We use 10-fold cross validation 
and measure performance by the average AUC over the 11 classes. Every run ((linked) LDA 
model build and classification) is repeated 10 times to get variance of the AUC classification 
performance. 

3.2 Baseline classifiers 

As the simplest baseline we use the tf.idf vectors with SVM for the small corpus, as it took a 
prohibitively long time to run it on the big corpus. Another baseline is to use the LDA delivered 
d topic distributions as features with the classifier BayesNet. 

As recently a large number of relational learning methods were invented, a complete com- 
parison is beyond the scope of this paper. Instead we concentrate on the stacked graphical 
learning method [19] that reaches best performance for classifying Web spam over this corpus 

m> 

The general stacked graphical procedure starts with one of the base learners of Subsection l3.ll 
that classifies each element v positive with weight p(v). Positive and negative instances in the 
training set have p(v) equal and 1, respectively. These values are used in a classifier stacking 
step to form new features f(u) based on certain p(v). Stacking can recursively be applied, hence 
if in one step we consider p(v) for the neighbors of u, then in a two-layer stacking we gather 
information from the distance two neighborhood. 

We use cocitation to measure node similarities. The cocitation coc(w, v) is defined as the 
number of common inneighbors of u and v. This measure turned out most effective for Web 
spam classification [2 . We may use both the input directed graph and its transpose by changing 
the direction of each link. We will refer to these variants as directed and reversed versions. 
Notice that reversed cocitation denotes bibliographic coupling (nodes pointed to by both u and 
v). Several other options to measure node similarities and form the neighborhood aggregate 
features are explored in (2] [10] . 

x http: //www. ims .uni-stuttgart . de/projekte/corplex/TreeTagger/ 
2 http: / /www. lextek. com/manuals/onix/stopwordsl .html 
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3.3 LDA inference 



In our LDA inference the following parameter settings are used. The number of topics is chosen 
to be k = 30. The Dirichlet parameter vector j3 is constant 200/|V|, and a is constant 50/fc. 
For linked LDA we also consider directed links between the documents. For a document d, the 
smoothing parameter vector 7^ was chosen in such a way that 

7d(c) oc w(d —} c) for all c 6 Sd, c ^ d and 
7d(d) oc 1 + 2_. w (d c ) 

cGSd,c^d 

such that X^cGSd "f d ( c ) ~ MI/Pj w here \d\ is the number of word positions in d (the document 
length), and w(d — > c) denotes the multiplicity of the d — > c link in the corpus. As a quick 
parameter sweep, we tried three values p = 1,4,10 with plain Gibbs sampler and BayesNet 
classifier. The accuracy was 0.835, 0.852 and 0.854 resp., so we chose p = 10 in the subsequent 
experiments. 

We take the i9 topic distribution vectors as features for the documents and use the classifiers 
of Subsection 13.11 For Gibbs sampling in LDA and linked LDA we apply the baseline as well 
as the aggregated, limit and sparse heuristics. Altogether this results in eight different classes 
of experiments. 

As another independent experiment we also measure the quality of the inferred edge weight 
function \- This weight can be used in the stacked graphical classification procedure of Sec- 
tion [IO] over the 1? topic distribution feature and the tf.idf baseline classifiers, for both the link 
graph and its reversed version. 

The same experiments are performed with the link-PLSA-LDA model [21] , using the C-code 
provided to us by Ramesh Nallapati. We also compared the running time of our Gibbs sampling 
strategies with the fast Gibbs sampling method of [33], using the C-code referred to therein^. 
For results, see Subsection l4.ll 

We developed an own C++-code for LDA and linked LDA containing plain Gibbs sampling 
and the three Gibbs sampling boostings proposed in this paper. This code is publicly availablqj 
together with the used DMOZ labels of the .uk sites. The computations were run on Linux 
machines with 50GB RAM and multicore 1.8GHz AMD Opteron processors with 1MB cache. 



4 Results 

4.1 Speedup with the Gibbs sampling strategies 

Applying aggregated, limit and sparse Gibbs samplings results in an astonishing speedup, see 
Tables [lJO Experiments were carried out on the small corpus, and the models were run with 
k = 30 topics. 

Observe that the speedup of aggregated Gibbs sampling is striking, and this does not at all 
come at the expense of lower accuracy. Indeed, results of Subsections 14. 21 and 14. 31 show less than 
1% decrease in terms of likelihood and AUC value after 50 iterations when using aggregated 
Gibbs sampling. The advantage of limit sampling over aggregated sampling (no need to draw 
from the distribution) turns out to be negligible. For sparse Gibbs sampling the speedup is a 
straightforward consequence of the fact that we skip many terms during sampling, and thus 
the running time is approximately a linear function of \ jt where I is the sparsity parameter. 
The constant term in this linear function is apparently approximately 75sec for LDA and linked 
LDA, certainly, this is the time needed to iterate through the 16G memory. As for the choice 
£ = 10, note that the running time of one iteration for LDA is only 9% of the one with plain 
Gibbs sampling, and still, the accuracy measured in AUC is only 2% worse as seen in Table [4] 

3 http: //www. ics .uci . edu/~iporteou/f astlda/ 
4 http: //www. ilab . sztaki .hu/~ibiro/linkedLDA/ 
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Gibbs sampler 


LDA 


linked LDA 


plain 


1000 


1303 


aggregated 


193 


1006 


limit 


190 


970 


sparse {1 = 2) 


135 


402 


sparse (t = 5) 


105 


241 


sparse (i = 10) 


91 


171 


sparse (£ = 20) 


84 


129 


sparse (I = 50) 


80 


107 



Table 1: Average CPU times for one iteration of the Gibbs sampler (in sees) 



model / sampler 


time 


LDA / fast Gibbs [23] 
link-PLSA-LDA / var. inf. gT] 


949 
19,826 



Table 2: Average CPU times for one iteration for baseline models and samplers (in sees) 



We point out that though we used the same C-code than [53] , our measurement on fast Gibbs 
sampling demonstrates a somewhat poorer performance than the results presented in [23] , even 
if one takes into account that fast Gibbs sampling is proven to have better performance with a 
large number of topics (k w 1000) (with k = 100 topics we experienced 2100sec on average). On 
the other hand, fast Gibbs sampler gets faster and faster for the consecutive iterations: in our 
experiments with k = 30 topics we measured 2656sec for the first iteration (much worse than 
for plain Gibbs) and 757sec for the 50^. Thus the calculated average would be better with 
more iterations - to which, however, there is no real need by the observations in Subsection l4.2l 

As the number of iterations with variational inference is usually chosen to be around 50, 
the same as for Gibbs sampling, we feel the above running times for Gibbs sampling and link- 
PLSA-LDA with its variational inference are comparable. 

4.2 Likelihood and convergence of LDA inference 

Figures [3][5] show the convergence of the likelihood and the AUC for BayesNet, for some combi- 
nations of LDA and linked LDA models run with various Gibbs samplers. The plots range over 
50 iterations, and we have stopped inference after every 2 iterations and calculated the AUC 
of a BayesNet classification over the -d features, and the likelihood (as described in Subsections 
12.11 and 12.21) . The experiments are run on the small corpus, the number of topics was k = 30 
for all models, and the parameters were as described in Subsection 13.31 

The high pairwise correlation of the otherwise quite different accuracy measures, likelihood 
and AUC values in Figure |3j is very interesting. This is certainly due to the fact that after the 
topic assignment stabilizes, there is only negligible variance in the i9 features. This behavior 
indicates that the widely accepted method of stopping LDA iterations right after the likelihood 
has stabilized can be used even if the inferred variables ($ in our case) are later input to other 
classification methods. 

The usual choice for the number of Gibbs sampling iterations for LDA is 500-1000. Thus 
it is worth emphasizing that in our experiments after only 20-30 iterations both likelihood and 
accuracy stabilizes. This is in accordance with the similar experiments of [TC1HH]: which found 
that for plain LDA, likelihood stabilizes after 50-100 iterations, over various corpora. As a 
consequence, we chose 50 as the number of iterations for the next experiments. 

Figure |4] demonstrates that the linked LDA model with plain, aggregated and limit samplers 
over-perform plain LDA by about 1% in likelihood. This gap increases to about 4% after 
applying classifiers to the inferred i? topic distributions, see Table @] The aggregated and limit 
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Figure 3: Correlation of the likelihood (the lower the better) and the AUC for BayesNet (the 
higher the better) for three choice of model / sampler combinations 
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Figure 4: Convergence of the likelihood for various models and samplers 



Gibbs boostings result in negligible deterioration in the likelihood, though they give 5 times 
speedup. 

The observation that much fewer iterations are enough for Gibbs sampling, combined with 
our Gibbs boosting methods bids fair that LDA may become a computationally highly efficient 
latent topic model in the future. 

4.3 Comparison with the baseline 

We run supervised web document classification as described in Subsection l3.2l The results can 
be seen in Tables [3][S1 the evaluation metric is AUC, averaged over the 11 big categories. See 
Table [3] for AUC values. The big corpus was too large for the baseline methods to terminate, 
so comparison with them in the small corpus can be seen in Table 0J 

The tables clearly indicate that applying aggregated, limit and sparse Gibbs sampling with 
sparsity I at most 10 has only a minor negative effect of about 2% on the classification accuracy, 
albeit they give significant speedup by Table [T] Linked LDA slightly outperforms the LDA- 
based categorization for all classifiers, by about 4%. This gap is biggest for BayesNet. 

Table [5] indicates that the \ link weights delivered by the linked LDA model captures 
influence very well, as it improves 2% over tf.idf, 4% over LDA and 3% over linked LDA with the 
cocitation graph, and 3% over link-PLSA-LDA with its own \ weights. This clearly indicates 
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Figure 5: Convergence of the likelihood for the sparse sampler for LDA with various sparsity 
parameters 



model / sampler 


BayesNet 


SVM 


C4.5 


LDA 


0.821 


0.710 


0.762 


LDA / aggr. 


0.820 


0.684 


0.756 


LDA / limit 


0.810 


0.695 


0.739 


LDA / sparse (£ = 10) 


0.788 


0.669 


0.719 


linked LDA 


0.854 


0.723 


0.765 


linked LDA / aggr. 


0.848 


0.711 


0.754 


linked LDA / a. sparse (£ = 10) 


0.837 


0.701 


0.733 



Table 3: Big corpus. Classification accuracy measured in AUC for LDA and linked LDA under 
various Gibbs sampling heuristics. Sparse' at linked LDA refers to the lazy version of the 
aggregated Gibbs sampler. 



that the x link weights provided by the linked LDA model are good approximation of the 
topical similarity along links. Reversing the graph influences behaves in a quite unpredictable 
way, though rev-cocit is somewhat better than cocit, furthermore, reversion worsens the AUC 
measure if the weights come from linked or link-PLSA-LDA \ values. 

Every run (LDA model build and classification with and without graph stacking) is repeated 
10 times to get variance of the AUC measure. Somewhat interestingly, this was at most 0.015 
throughout, so we decided not to quote them individually. 

5 Conclusion and future work 

In this paper we introduced the linked LDA model which integrates the flow of influence along 
links into LDA in such a way that each document can be citing and cited at the same time. 
By our strategies to boost Gibbs sampling we were able to apply our model to supervised 
web document classification as a feature generation and dimensionality reduction method. In 
our experiments linked LDA outperformed LDA and other link based LDA models by about 
4% in AUC. One of our Gibbs sampler heuristics produced 10-fold speedup with negligible 
deterioration in convergence, likelihood and classification accuracy. Over our data set of Web 
hosts, these boostings outperform the fast Gibbs sampler of [23] in speed to a great extent. We 
also note that our samplers use ideas orthogonal to fast Gibbs sampling [23] and the paralleled 
sampling of [22], and so these methods can be used in combination. It would be interesting to 
explore other domains than LDA where our Gibbs sampling strategies can be applied. Limit 
Gibbs sampling makes it possible to have arbitrary non-negative real numbers as word counts in 
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model / sampler 




SVM 


C4.5 


LDA 




817 


705 


767 

yj. 1 U 1 


_ljja / aggr. 




0.813 


0.691 


0.750 


[DA / limit 




808 


669 


790 

yj. 1 


[DA / crarQP (f — 9*\ 




805 


654 


71 9 
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yj. 1 iju 
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0.764 


0.649 


0.689 


\ j \ J t\ / &jJdjl oc It — Oy) 1 




0.735 


0.624 


0.670 


linked T,BA 








D 777 


linlcpH Til~)A / ^fPT 




0.849 


0.696 


0.771 


linked LDA / limit 




0.845 


0.688 


0.761 


linked LDA / sparse (£, = 


2) 


0.840 


0.683 


0.758 


linked LDA / sparse (£ = 


5) 


0.836 


0.679 


0.753 


linked LDA / sparse (£ = 


10) 


0.827 


0.673 


0.751 


linked LDA / sparse (£ = 


20) 


0.799 


0.656 


0.726 


linked LDA / sparse (£ = 


50) 


0.768 


0.630 


0.705 


link-PLSA-LDA/var. inf. 


m 


0.827 


0.687 


0.754 


tf.idf 


0.569 


0.720 


0.565 



Table 4: Small corpus. Classification accuracy measured in AUC for LDA and linked LDA 
under various Gibbs sampling heuristics as well as the baseline methods. 



a document, instead of the usual tf counts. To this end, we plan to measure whether accuracy of 
LDA is improved if the tf counts are replaced with the pivoted tf.idf counts of [25]. As a further 
research we will investigate possible application of the linked LDA model to other domains, like 
web spam filtering. 
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